language ideology
Do language models practice what they preach? Examining language ideologies about gendered language reform encoded in LLMs
Watson, Julia, Lee, Sophia, Beekhuizen, Barend, Stevenson, Suzanne
We study language ideologies in text produced by LLMs through a case study on English gendered language reform (related to role nouns like congressperson/-woman/-man, and singular they). First, we find political bias: when asked to use language that is "correct" or "natural", LLMs use language most similarly to when asked to align with conservative (vs. progressive) values. This shows how LLMs' metalinguistic preferences can implicitly communicate the language ideologies of a particular political group, even in seemingly non-political contexts. Second, we find LLMs exhibit internal inconsistency: LLMs use gender-neutral variants more often when more explicit metalinguistic context is provided. This shows how the language ideologies expressed in text produced by LLMs can vary, which may be unexpected to users. We discuss the broader implications of these findings for value alignment.
- North America > Canada > Ontario > Toronto (0.29)
- North America > United States > Washington > King County > Seattle (0.04)
- Asia > Singapore (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Government (0.67)
- Media > News (0.46)
- Leisure & Entertainment (0.46)
- Law Enforcement & Public Safety (0.46)
Standard Language Ideology in AI-Generated Language
Smith, Genevieve, Fleisig, Eve, Bossi, Madeline, Rustagi, Ishita, Yin, Xavier
In this position paper, we explore standard language ideology in language generated by large language models (LLMs). First, we outline how standard language ideology is reflected and reinforced in LLMs. We then present a taxonomy of open problems regarding standard language ideology in AI-generated language with implications for minoritized language communities. We introduce the concept of standard AI-generated language ideology, the process by which AI-generated language regards Standard American English (SAE) as a linguistic default and reinforces a linguistic bias that SAE is the most "appropriate" language. Finally, we discuss tensions that remain, including reflecting on what desirable system behavior looks like, as well as advantages and drawbacks of generative AI tools imitating--or often not--different English language varieties. Throughout, we discuss standard language ideology as a manifestation of existing global power structures in and through AI-generated language before ending with questions to move towards alternative, more emancipatory digital futures.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > California > Alameda County > Berkeley (0.05)
- Europe > United Kingdom > England > Greater London > London (0.05)
- (14 more...)
- Media (1.00)
- Health & Medicine (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.51)
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Gururangan, Suchin, Card, Dallas, Dreier, Sarah K., Gade, Emily K., Wang, Leroy Z., Wang, Zeyu, Zettlemoyer, Luke, Smith, Noah A.
Language models increasingly rely on massive web dumps for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles -- written by students from across the country -- we investigate whose language is preferred by the quality filter used for GPT-3. We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We then demonstrate that the filter's measurement of quality is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language ideology, and more care is needed to construct training corpora for language models, with better transparency and justification for the inclusion or exclusion of various texts.
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- (11 more...)
- Research Report > New Finding (1.00)
- Personal (1.00)
- Research Report > Experimental Study (0.69)
- Media > News (1.00)
- Law (1.00)
- Government (1.00)
- (2 more...)